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Abstract: The eonventional eontrol paradigm for a heat pump with a less effieient auxiliary 
heating element is to keep its temperature set point constant during the day. This eonstant 
temperature set point ensures that the heat pump operates in its more efficient heat-pump 
mode and minimizes the risk of activating the less efficient auxiliary heating element. As 
an alternative to a eonstant set-point strategy, this paper proposes a learning agent for a 
thermostat with a set-baek strategy. This set-baek strategy relaxes the set-point temperature 
during eonvenient moments, e.g. when the oeeupants are not at home. Einding an optimal 
set-baek strategy requires solving a sequential decision-making process under uneertainty, 
whieh presents two ehallenges. A first challenge is that for most residential buildings a 
deseription of the thermal eharaeteristies of the building is unavailable and ehallenging to 
obtain. A seeond ehallenge is that the relevant information on the state, i.e. the building 
envelope, eannot be measured by the learning agent. In order to overeome these two 
challenges, our paper proposes an auto-encoder coupled with a batch reinforcement learning 
technique. The proposed approach is validated for two building types with different thermal 
characteristics for heating in the winter and eooling in the summer. The simulation results 
indieate that the proposed learning agent ean reduee the energy eonsumption by 4-9% 
during 100 winter days and by 9-11% during 80 summer days eompared to the eonventional 
constant set-point strategy. 

Keywords: Auto-encoder, Batch reinforcement learning; Heat pump; Set-baek thermostat. 
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1. Introduction 

Residential and eommereial buildings use about 20-40% of the global energy eonsumption [1]. Half 
of this energy is eonsumed by Heating, Ventilation and Air Conditioning (HVAC) systems. About 
two-thirds of these HVAC systems use fossil fuel sourees, sueh as oil, eoal and natural gas. Replaeing this 
large share of fossil fueled HVAC systems with more energy effieient heat pumps ean play an important 
role in redueing greenhouse gasses [2-4]. For instanee in [5], Bayer et al. report that replaeing fossil 
fuel based HVAC systems with eleetrie heat pumps ean help reduee greenhouse gasses in spaee heating 
by 30-80% in different European eountries. The eardinal faetors that influenee this reduetion are the 
substituted fuel type, the energy effieieney of the heat pump and the eleetrieity generation mix of the 
eountry. 

This paper foeuses on residential heat pumps equipped with an auxiliary heating element. This heating 
element ean be a less effieient eleetrie furnaee or a gas- or oil-fired furnaee. In its regular operation, a heat 
pump runs in its more energy effieient heat-pump mode, however, when the temperature drops too low, 
both the heat pump and the auxiliary heating element are aetivated. Sinee most heat pumps are equipped 
with an eleetrie auxiliary heating element, whieh ean be four times less effieient, the U.S. Department 
of Energy reeommends to operate the thermostat with a eonstant target temperature during the day, even 
when the inhabitants are not at home [6] . 

As an alternative to the eonstant temperature set-point strategy, this paper presents a set-baek method, 
in whieh the temperature set point is relaxed during eonvenient times, for example, during the night 
or when the inhabitants are not at home. Sueh a set-baek method ean reduee the energy eonsumption 
eompared to the eonstant set-point strategy under the eondition that it ean avoid the auxiliary heating to 
aetivate [7]. 

The remainder of this paper is organized as follows. Seetion 2 gives on overview of existing literature 
on heat-pump thermostats and their applieation to demand response. Seetion 3 addresses the ehallenges 
of developing a sueeessful set-baek strategy to reduee the energy eonsumption of a heat pump. Seetion 4 
formulates the sequential deeision-making problem of a thermostat agent as a stoehastie Markov deeision 
proeess. Seetion 5 proposes an approaeh based on an auto-eneoder and fitted Q-iteration. The simulation 
results are given in Seetion 6, and, finally, Seetion 7 summarizes the general eonelusion of this work. 

2. Literature review 

Driven by the potential of heat pumps to reduee greenhouse gasses, heat-pump thermostats 
have attraeted the attention from researehers [8,9] and eommereial eompanies [10-14]. A popular 
eontrol paradigm in the literature on optimal eontrol of heat-pump thermostats is a model-based 
approaeh. Within this paradigm, a first type of model-based eontrollers uses a model predietive 
eontrol approaeh [2,15]. At eaeh deeision step, the eontroller defines a eontrol aetion by solving a 
fixed-horizon optimization problem, starting from the eurrent time step and using a ealibrated model of 
its environment. For example, the authors of [9], use a mixed-integer quadratie programming solution 
to minimize the eleetrieity eost and earbon output of a home heating system. However, the performanee 
of these model-based approaehes depends on the quality of the model and the availability of expert 
knowledge. Model-based approaehes ean aehieve very good results within a reasonable learning period. 
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but typically they have to be tailored for their applieation and they have diflieulties with stoehastie 
environments [16]. A seeond type of model-based eontrollers formulates the eontrol problem as a 
Markov deeision problem and solves the eorresponding problem using teehniques from approximate 
dynamie programming [17,18]. For example in [8], Urieli et al. use a linear regression model to fit the 
model of the building and then apply a tree-seareh algorithm for finding an intelligent set-baek strategy 
for a heat-pump thermostat. Alternatively in [19], Morel et al. propose an adaptive building eontroller 
that makes use of artifieial neural networks and dynamie programming. Similarly, the authors of [20] 
propose a eombined neuro-fuzzy model for dynamie and automatie regulation of indoor temperature. 
They use an artifieial neural network to foreeast the indoor temperature, whieh is then used as the input 
of a fuzzy logie eontrol unit in order to manage energy eonsumption of the HVAC system. In addition, 
the authors of [21] report that an artifieial neural network based model ean adapt to ehanging building 
baekground eonditions, sueh as the building eonfiguration, without the need for additional intervention 
by an expert. 

An alternative eontrol paradigm makes use of model-free reinforeement learning teehniques in order 
to avoid the system identifieation step of model-based eontrollers. For example, the authors of [22], 
propose a Q-learning approaeh to minimize the eleetrieity eost of a thermal storage. In [23], Zheng et al. 
show how a Q-learning approaeh ean be used for residential and eommereial buildings by deeomposing 
it over different deviee elusters. However, a main drawbaek of elassie reinforeement learning algorithms, 
sueh as Q-learning and SARSA, is that they diseard observations after eaeh interaetion with their 
environment. In eontrast, bateh reinforeement learning teehniques do not require many interaetions 
until eonvergenee to obtain reasonable polieies [24-26], sinee they store and reuse past observations. 
As a result, they have a shorter learning period whieh makes them an attraetive teehnique for real-world 
applieations, sueh as a heat-pump thermostat. In both [27] and [28], the authors use a bath reinforeement 
learning teehnique, fitted Q-iteration, in eombination with a market-based multi-agent system, in order 
to eontrol a eluster of flexible deviees, sueh as eleetrie vehieles and eleetrie water heaters. 

This work eontributes to the applieation of bateh reinforeement learning to the problem of finding 
a sueeessful set-baek strategy for a heat-pump thermostat. This problem was previously addressed 
by the work of Urieli et al. in [8]. The main differenee with their work is that our work proposes a 
model-free approaeh that ean intrinsieally eapture the stoehastie nature of the problem. The authors 
build on the existing literature on bateh reinforeement learning, in partieular fitted Q-iteration [24], and 
auto-eneoders [29]. 

3. Problem Statement 

The main objeetive of this paper is to develop a model-free learning agent for a heat pump with an 
auxiliary heating element in order to overeome the following two ehallenges. The first ehallenge is that 
the auxiliary heating element aetivates when the indoor temperature reaehes a predefined temperature 
threshold. The operation of the thermostat is given by Algorithm 2 and ean be found in Appendix A. 
More information in the temperature settings of the thermostat ean be found in Table 2 of Appendix A. 
In order to illustrate the aetivation of the auxiliary heating element, two thermostat agents are depleted 
in Figure 1. Our set-baek strategy relaxes the indoor temperature during working hours, i.e. 7-17h 
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Figure 1. Indoor temperature of two thermostat agents with a set-back strategy (7-17h). 
Agent one operates in normal heat-pump mode and agent two activates the less efficient 
auxiliary heating. 

(Figure 1). It can be seen that the first agent correctly anticipates the comfort bounds at 17h and 
begins to heat the building in normal heat-pump operating mode (point A). The second agent postpones 
heating until point B and triggers the electric auxiliary heating to switch on in point C. As a result of the 
activation of the less efficient auxiliary heating element, the second agent consumes more energy than 
the recommended constant temperature set-point strategy. 

A second important challenge when developing an intelligent set-back strategy is that the moment 
of activating the heat pump does not only depend on the weather conditions, but also on the thermal 
characteristics of the building. This challenge is illustrated by a second example, where a successful 
set-back strategy is depicted for two building types. Both building types have identical outside 
temperatures and inner disturbances. Figure 2a. depicts the indoor temperature of a building with a high 
insulation level, whereas Figure 2b. depicts the indoor temperature of a building with a low insulation 
level. It can be seen that the thermal characteristics of the building can have a significant impact on the 
operation of the thermostat agent. For instance, the set-back thermostat in Figure 2a. can postpone its 
heating action until quarter 68, while the set-back thermostat in Figure 2b. needs to start heating around 
quarter 60 in order to avoid the activation of the auxiliary heating. 

4. Markov Decision Process 

Motivated by the challenges presented in Section 3 and driven by recent advances in reinforcement 
learning [24,30,31], our paper introduces a model-free learning agent. In order to use reinforcement 
learning techniques, the sequential decision-making problem of a heat-pump thermostat with set-back 
strategy is formulated as a stochastic Markov decision process [18,32]. 
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Figure 2. Indoor temperature of two buildings with different thermal insulation, (a): building 
with high thermal insulation level, (b): building with low thermal insulation level. 

At every decision step k the thermostat agent chooses a control action Uk e U c R and the state of its 
environment x/, e X c R‘^ evolves according to the transition function /: 

Xk+I = f(xk,Uk,Wk) Vfc e {l,...,r - 1}, (1) 

with Wk a realization of a random process drawn from a conditional probability distribution p<wi'\xk)- 
After a transition to the next state Xk+i, the agent receives an immediate cost Ck provided by: 

Ck= p{Xk,Uk,Wk) ( 2 ) 


where p is the cost function. The goal of the thermostat agent is to find a control policy h* : X ^ U that 
minimizes the expected T -stage return for any state in the state space. The expected T -stage return Jtf 
starting from xi and following h* is defined as follows: 


Jt (-^i) - 1 


r T 


'Y^p{xk,h*{xk),Wk) , 

-k=i 


( 3 ) 


where E denotes the expectation operator. A more convenient way to characterize the policy h* is by 
using a state-action value function or Q-function: 


Q^\x, u)= E \p{x, u, w) + Jjifix, h(x), w))l. (4) 

The Q-value is the cumulative return starting from state x, taking action u, and following h* thereafter. 
Starting from a Q-function for every state-action pair, the policy is calculated as follows: 


h*(x) 6 arg min (x, u), 

ueU 


( 5 ) 
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where h* satisfies the Bellman optimality equation [33]: 

7^’ (a:) = min E \p{x, h{x), w) + Jj {f(x, h{x), w))l. (6) 

The eentral idea behind bateh reinforeement learning is to estimate the state-aetion value funetion 
based on a set of past observations (or bateh) of the state, eontrol aetion and reward. Note that this 
approaeh does not require a model of the environment / or the disturbances w. As a result no system 
identification step is needed. The following five paragraphs give a tailored definition of the state, action, 
cost function and transition function of a heat-pump thermostat agent. 

4.1. Observable State 

At each time step k, the thermostat agent can measure the following state information: 

id, t, Sk), (7) 

where d 6 (1,... ,7} represents the current day in the week and t 6 {1,... ,96} the current quarter of 
the hour. The observable state information related to the physical state of the building is given by a 
measurement of the indoor temperature The observable exogenous state information is defined by 
Touuk and S which are the outdoor temperature and solar irradiance at time step k. Note that by including 
the measurements of Toavk and 5 * at time step k our approach captures a first-order correlation of these 
stochastic variables. 

4.2. Thermostat Function 

In order to guarantee the comfort of the end-user the heat pump is equipped with a thermostat 
mechanism (Algorithm 2). The thermostat logic maps the requested control action Uk taken in state 
's.k to a physical control action 

uf = Tixu,u,). ( 8 ) 

As such, the thermostat function T maps the requested control action to a physical quantity, which is 
required to calculate the cost value. 

4.3. Augmented State 

As previously stated, this paper assumes that the temperature of the building envelope cannot be 
measured. It is important to realize that the temperature of the building envelope contains essential 
information to accurately capture the transient response of the indoor air temperature. Moreover, the 
temperature of the building envelope represents information on the amount of thermal energy stored in 
the thermal mass of the building. A possible strategy is to represent the temperature of the building 
envelope by a handcrafted feature based on expert knowledge, which can be difficult to obtain for 
residential buildings. However, a more generic strategy is to include past observations of the state and 
action in the state variable [30,34] : 


'^aug,A: 


{d^ ^ k)^ 


( 9 ) 
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with 


"^k — (7^in,A-l 


T 

• ^ m,k—m 


ph N, 

), 

’ k-n-” 


( 10 ) 


where n denotes the number of past observations of the indoor temperatures and physieal aetions. Note 
that the physieal eontrol actions have been included in the state, since they give an indication of the 
amount of energy added to the system. In the next section, a feature extraction technique is proposed 
to mitigate the “curse of dimensionality” [33] and find a compact representation of the augmented state 
vector. 


4.4. Transition Function 

A detailed description of the transition function / that models the temperature dynamics of the 
building is given in Appendix B. This paper proposes a model-free approach and makes no assumption 
of the model type or its parameters. 

4.5. Cost Function 

The cost function p : X x U ^ R, associated with a single transition, is given by: 

Ck = uf'At + ak, (11) 

where the parameter at represents a penalty for violating the comfort constraints and At represent the 
time interval of one control period. When the indoor air temperature Tin is lower than Ts or higher than 
fs, at is set to 10^ and otherwise 0. 


5. Model-Free Batch Reinforcement Learning Approach 


Given full knowledge of the transition function, it can be possible to find an optimal policy by 
solving the Bellman equation (6) for every state-action pair using techniques from approximate dynamic 
programming [17,33]. This paper, however, applies a model-free batch reinforcement learning technique, 
where the sole information available to solve the problem is the one obtained from daily observations of 
the following one-step transitions: 


T = {ix^^aj,ui,x' i,uf)}%. 


( 12 ) 


where each tuple is made up of the augmented state Xaug,/, the control action ui, physical control action 
and its successor state Figure 3 outlines the building blocks of the model-free batch reinforcement 
learning method, which consists of two interconnected loops. 


5.L Offline Foop 


The offline loop contains a feature extraction technique and a batch reinforcement learning method. 
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Online Offline 



Figure 3. Online and offline loop of our proposed approaeh, whieh eonsists of a feature 
extraetion (offline), fitted Q-iteration (offline) and Boltzmann exploration (online). 

5.1.1. Feature Extraetion 

This paper proposes a feature extraetion teehnique to find a low dimensional representation of the 
augmented state Xaug,^: = (d, t, z^, Tout,k, Sk), by redueing the dimensionality of the state information 
eorresponding to past observations: 

Ze,, = 0(z,,lT), (13) 

where O : Z ^ Zg is a feature extraetion funetion that maps z 6 Z c to the eneoded state Zg 6 Zg c 
R^, with q < p and where W eontains the parameters eorresponding to O. 

This work introduees a feature extraetion teehnique based on an auto-eneoder. An auto-eneoder or 
auto-assoeiative neural network is an artifieial neural network, with the same number of input as output 
neurons and a smaller number of hidden feature neurons. These hidden feature neurons funetion as a 
bottleneek and ean be seen as a redueed representation of z^. During training of the auto-eneoder the 
output data is set to be equal to the input data. The weights of the network are then trained to minimize 
the squared error between the inputs and its reeonstruetion. Different training methods to find W ean be 
found in the literature [35-37]. However, eomparing the performanee of these training methods is out of 
the seope of this paper. This work uses an hierarehieal training strategy that uses a eonjugate gradient 
deseent method [38]. 

The next paragraph explains how a popular bateh reinforeement learning teehnique, i.e. fitted 
Q-iteration, ean be used, given: 

= {(Xaug,/, (14) 

with 

Xaug,/ = {d, t. Tin,/, 0(Z/, IT), Tout,;, s /). (15) 

where Xaug,/ denotes the redueed augmented state, and 0(z/, VT) denotes the eneoded state information of 
the past observations Z/. 

5.1.2. Fitted Q-iteration 

Although other bateh reinforeement learning teehniques ean be used, this work eontributes to the 
applieation of fitted Q-iteration [24] . Fitted Q-iteration makes effleient use of gathered data and ean be 
eombined with different regression methods. In eontrast to standard Q-leaming [32], fitted Q-iteration 
eomputes the Q-funetion offline and makes use of the whole bateh. An overview of the fitted Q-iteration 
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Algorithm 1 Fitted Q-iteration [24] 

Input: T<R = {(Xaug,/,M/,Xaug,/+I,Mf)}^ 

1 : Initialize Qq to zero 
2 for A = 1,..., r do 
3: for / = 1,.. .,#T do 

4: Cj p(Xaug,/5M^ ) 

5: Qnj <— c/ + min QN-i,ii%w.g,M,u) 

ueU 

6 end for 

7: use regression to obtain from |((Xaug,/, m/), Qnj) J = 1, • • • , 

8: end for 
Output: Q* = Qn 


algorithm is given in Algorithm 1 . The algorithm iteratively builds a training set with all state-action pairs 
in Tn as the input. The target values consist of the corresponding cost values and the optimal Q-value, 
based on the approximation of the previous iteration, for the next state. This works uses an extremely 
randomized trees ensemble method [39] to find an approximation Q of the Q-function. The ensemble 
was set to 60 trees and a minimum of 3 samples for splitting a node. The number of samples selected at 
each node was set to the input dimension of the input space. More information on the regression method 
can be found in [39]. 


5.2. Online loop 


A Boltzmann exploration strategy [40] is used at each decision step to find the probability of selecting 
an action: 





(16) 


where the parameter controls the amount of exploration and Q* is the Q-function obtained with 
Algorithm 1. The parameter is decreased during the simulation following an harmonic sequence [17]: 


Td 



(17) 


where d denotes the current day and n is set to 0,7. Note that, if = 0, than the policy become greedy 
and the best action is chosen. 


6. Simulation Results 

This section compares the performance of our learning agent to a default constant set-point strategy 
and a prescient set-back strategy. 

6.1. Simulation Setup 

The simulations use a second-order equivalent thermal parameter model to calculate the indoor air 
temperature and the envelope temperature of the building [41]. The parameters correspond to a building 
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with a floor area of 200 and a window to floor ratio of 30%. The model equations and parameters are 
presented in Appendix B. The building is equipped with a heat pump to satisfy the heating or cooling 
demand of the inhabitants. The heat pump can change its power set point every 15 minutes with 10 
discrete heating or cooling actions. This paper considers two comfort settings, i.e. the default strategy 
and the set-back strategy. The default strategy has a constant temperature set point of 20.5°C during the 
entire day. In contrast, the set-back strategy relaxes the set-point temperature from 7h to 17h when the 
inhabitants are not present in the building. 

The heat-pump thermostat is equipped with sensors to measure the outside temperature and solar 
irradiation, which are measurements obtained from a location in Belgium [42]. This work assumes that 
the heat-pump thermostat is provided with a forecast of the outside temperature and solar irradiation. 
However, internal heat gains caused by the inhabitants and electrical appliances cannot be measured or 
forecasted and are obtained from [43]. 

6.2. Learning Agent 

The weights of the auto-encoder network are calculated at the beginning of each day and are used 
to calculate 7^. The state information corresponding to the past observations, consists of the previous 
10 indoor temperatures and the previous 10 control actions. The number of hidden neurons of the 
auto-encoder network is set to 6. Given the batch 7^, the fitted Q-iteration algorithm constructs a 
Q-function for the next day (see Algorithm 1). This Q-function is then used online by the Boltzmann 
exploration strategy. Note that each simulation run begins with an empty batch of tuples and that 
observations of previous day are daily added to 7^. 

6.3. Prescient Method 

The prescient set-back strategy assumes that the model parameters are known and it has prescient 
knowledge on the outside temperature, solar irradiation, and internal heat gains. A detailed description 
of the prescient method can be found in Appendix C. The outcome of the prescient set-back strategy is 
used to evaluate the performance of the learning agent and can be seen as an absolute lower-bound. 

6.4. Simulation Results 

The experiments compare the energy consumption and temperature violations of the learning agent 
with a set-back strategy, conventional constant set-point strategy and prescient set-back strategy. Note 
that the conventional constant temperature set point is the recommended strategy by the U.S. Department 
of Energy [6] . In order to examine the adaptability of the learning agent, an identical learning agent is 
applied to two building types, with a high and low thermal insulation level. The evaluation is repeated for 
100 winter days (heating mode) and 80 summer days (cooling mode). Figure 4 depicts the cumulative 
energy consumption of the default controller, prescient controller and learning agent. As can be seen 
in Figure 4, the learning agent is able to reduce the total energy consumption compared to the default 
strategy for both building types. The simulation results indicate that the learning agent was able to reduce 
the energy consumption by 4-9% during the winter and by 7-11% during the summer. It should be noted. 
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Winter (low thermal insulation) Winter (high thermal insulation) 



Days 



Days 


Summer (low thermal insulation) 


Summer (high thermal insulation) 



Figure 4. Cumulative energy plot of the default eonstant set-point strategy, our learning 
agent with set-back strategy and prescient set-back strategy. 

however, that the total energy consumption does not give a complete picture, as it does not consider the 
temperature violations. Remember that a comfort violation in the heating mode resulted in the activation 
of the less efhcient auxiliary heating element. However, in the cooling mode no auxiliary cooling is 
available. For this reason Figure 5 shows the daily performance metric Md and the daily deviation Dd 
between the temperature set point and the indoor temperature at 17 hour, which is the end of the set-back 
period. The daily deviation is calculated as follows: 


Dd = max(rin,i7 - rs,i7,0) -r max(rs,i7 


Tin, 17 , 0), 


(18) 


where rin,i 7 is the indoor temperature at 17h, and where Tg n and Tg n are the minimum and maximum 
temperature set point at 17h. The daily performance metric Md is calculated as follows: 


Md = 


gj ~ gd 
gp gd 


(19) 


where e\ denotes the daily energy consumption of the learning agent, e^ denotes the the daily energy 
consumption of the default strategy and e^ denotes the daily energy consumption of the prescient 
controller. As such, the metric M corresponds to 0 if the learning agent obtains the same performance 
as the default strategy and corresponds to 1 if the learning agent obtains the same performance as the 
prescient controller. These figures show that the comfort violations decrease over the simulation horizon. 
At the same time the performance metric M increases. 
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Figure 5. The performance metric M and the temperature violations D for the learning agent 
with set-back strategy. 

The results obtained with a mature controller (batch size of 30 days) are depicted in Figure 6. 
Figure 6. a and Figure 6.b depict the indoor temperature and power consumption profile during 7 winter 
days. Similarly, Figure 6.c and Figure 6.d depict the indoor temperature and power consumption during 
7 summer days. The simulation results indicate that the proposed learning agent can adapt itself to 
different building types and outside temperatures. In addition, the learning agent with set-back strategy 
can reduce the energy consumption of a heat pump compared to the conventional constant temperature 
set-point strategy. 


7. Conclusion and Future Work 

This work addressed the challenge of developing a learning agent for a heat pump with a set-back 
strategy that saves energy compared to a constant temperature set-point strategy, which is recommended 
by the U.S. Department of Energy. To this end, this paper proposed an approach based on an 
auto-encoder and a popular model-free batch reinforcement learning technique, i.e. fitted Q-iteration. 
The auto-encoder is used to reduce the dimensionality of the state vector, which contains past 
observations of the indoor temperatures and energy consumptions. The performance of the set-back 
strategy has been evaluated for heating in the winter and cooling in the summer for two building types 
with different thermal characteristics. An equivalent thermal parameter model has been used to simulate 
the temperature dynamics of the indoor air temperature and the temperature of the building envelope. 
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Hi 2 3 4 5 6 7 


(d) 



1 2 3 4 5 6 7 


Time [days] 


Figure 6. Indoor temperature and power consumption of the learning agent with set-back 
strategy during seven winter (a,b) and summer days (c,d). 

During the winter period, the set-back strategy was able to reduce the energy consumption with 4-9% 
compared to the default strategy. During the summer period, the set-back strategy saved 7-11% compared 
to the default strategy. The results indicated that the proposed learning agent can adapt itself to different 
building types and weather conditions. The proposed learning agent obtained these results without 
making assumptions on the model or its parameters. As a result, the learning agents can be applied 
to virtually any building type. 

With this work, we intended to show that model-free batch reinforcement learning techniques can 
provide a valuable alternative to model-based controllers. In our future work, we plan to focus on 
including real-time electricity prices in the objective and on implementing the presented approach in a 
lab environment. 
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Algorithm 2 Thermostat of a heat pump with auxiliary heating 
1 : Measure Tin 

2 : Initialize rs,rb,r^,fs 

3: if Tin < Is - T^ 

i/Ph = + until Tin > Ts + T’b 

4 elseif T, - T^ < T.n <T, + Tk 

mP*' = Ph 

5 elseif Ts + T’b < T’in < 

mP^* = M 

6 elseif Tin > fs 

mP*' = Pc until Tin <f^-T<n 

7: end if 

This paper is part of the doetoral researeh of Frederik Ruelens, supervised by Ronnie Belmans. All 
authors have been involved in the preparation of the manuseript. 
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Appendix A: Thermostat Logic 

Algorithm 2 illustrates the working of the heat-pump thermostat. The lower and upper temperature 
set points are given by T^ (20 °C) and (22.5 °C). When the indoor temperature Tin drops below 
Ts - T’b (18.5 °C ) the auxiliary heating element is aetivated in addition to the heat pump until the indoor 
temperature reaehes T^ + Tb (20.5 °C ). The aetivation of the auxiliary heating is independent of the 
requested eontrol action and can be seen as an overrule mechanism that guarantees the comfort of the 
end user. If the indoor temperature Tin is between T^ + T^^ and the thermostat controller follows the 
requested control action. When the indoor temperature Tin rises above fg the cooling mode is activated 
until fg - T’b (19.5 °C ). In cooling mode no cooling element is used. 


Appendix B: Model Equations 

In order to obtain system trajectories of the indoor air temperature, an equivalent thermal parameter 
model is used to calculate the indoor air temperature Tin and envelope temperature Tm of a residential 
building [41,44]: 


ti 


- Tin{Un + H^n) + Qi + TonJJ^ 

(Tin — Tna) + Qm), 

^m 


( 20 ) 


where is the thermal conductance of the building envelope, is the thermal conductance between 
air and mass, is the thermal mass of the air, and is the thermal mass of the building and its contents. 
The heat added to the interior air mass Qi is given by a fraction a of the internal heat gains 2g, a fraction 
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Table 1. Lumped Parameters 


Parameters 

Eow thermal 

High thermal 

Unit 

t/a 

1154 

212 

wrc 


6863 

6863 

wrc 

Ca 

2.441 

2.441 

MJ/°C 

Cm 

9.896 

9.896 

MJ/°C 


P of the solar heat gains Q^, and the heat gains generated by the heat pump 2h- The heat added to the 
interior solid mass Q^a is given by the other fraetions of and Qg. 

Qi = «2g+^2s + 2h ^2^') 

Table 1 and 2 give the parameters used in the simulations. The outside temperature Tout and solar 
irradianee were obtained from a loeation in Belgium [42]. Although more detailed building models exist 
in the literature [45], the authors believe that the used model is aeeurate enough to illustrate the working 
of the proposed model-free approaeh and at the same time flexible enough in terms of parameters and 
eomputational speed. 

Appendix C: Prescient Method 

The optimization problem of the preseient method is formulated as follows: 

minimize Z 

subjeet to f(xk, ut, w) = Xk+i. (22) 

= T(Xk,Uk), 

where the plant model / is defined by (20) and thermostat T by Algorithm 2. In eontrast to our model-free 
approaeh, the preseient method knows the plant model and future disturbanees. An optimal solution 
of this optimization problem was found by applying a mixed-integer linear programming solver using 
Gurobi [46] and YALMIP [47]. 
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